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ABSTRACT 

We introduce a new adaptive and fully Bayesian grid-based method to model strong grav- 
itational lenses with extended images. The primary goal of this method is to quantify the level 
of luminous and dark-mass substructure in massive galaxies, through their effect on highly- 
magnified arcs and Einstein rings. The method is adaptive on the source plane, where a De- 
launay tessellation is defined according to the lens mapping of a regular grid onto the source 
plane. The Bayesian penalty function allows us to recover the best non-linear potential-model 
parameters and/or a grid-based potential correction and to objectively quantify the level of reg- 
ularization for both the source and the potential. In addition, we implement a Nested-Sampling 
technique to quantify the errors on all non-linear mass model parameters - marginalized over 
all source and regularization parameters - and allow an objective ranking of different potential 
models in terms of the marginalized evidence. In particular, we are interested in comparing 
very smooth lens mass models with ones that contain mass-substructures. The algorithm has 
been tested on a range of simulated data sets, created from a model of a realistic lens system. 
One of the lens systems is characterized by a smooth potential with a power-law density pro- 
file, twelve include a Navarro, Frenk and White (NFW) dark-matter substructure of different 
masses and at different positions and one contains two NFW dark substructures with the same 
mass but with different positions. Reconstruction of the source and of the lens potential for 
all of these systems shows the method is able, in a realistic scenario, to identify perturbations 
with masses > 1O 7 M when located on the Einstein ring. For positions both inside and out- 
side of the ring, masses of at least 1O 9 M are required (i.e. roughly the Einstein ring of the 
perturber needs to overlap with that of the main lens). Our method provides a fully novel and 
objective test of mass substructure in massive galaxies. 
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1 INTRODUCTION 

At the present time, the most popular cosmological model for struc- 
ture formation is the ACDM paradigm. While this model has been 
very successful in describing the Universe on large scales and 
in reproducing numerous observational results (e.g., Reiss et all 
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Spergel et alj|2003l : iKomatsu et al.ll2008f) , important discrepancies 
still persist on small scales. In particular, some of the se involve the 
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and the number of galaxy satellites, i.e the Missing Satellite Prob- 
lem. 

According to the standard scenario, structures form in a hier- 
archical fashion via merging and accretion of smaller objects 
dToomrel 1 19771: iFrenk et alj 1 19881: IWhite & Frenkl Il99ll : iBarnesI 
Il992l : ICole et alj|2000l) . As shown by the latest numerical sim- 
ulations, in which high mass and force resolution is achieved, 
the progenitor population is only weakly affected by virializa- 
tion processes and a large number of sub-haloes is able to sur- 
vive after merging. The number of substructures within the Lo- 
cal Group, however, is predicted to be 1-2 ord ers of magnitude 



highe r than what is effe c tively observed (e .g., Kauffmann et alj 



1993; 
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Two different classes of solutions have been suggested to alle- 
viate this problem, cosmological and astrophysical. Cosmolog- 
ical solutions address the basis of the ACDM paradigm itself 
and mostly concentrate on the properties of the dark matter, al- 
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lowing for example, for a warm dColin et al ]|2QQd) , decaying 
' Cenl200ll) . self- interacting dSpergel & Steinhardtl2000)). repulsiv e 
Goodm an 2000), or annihilating nature i Riotto & Tkachevll2000h . 



Alternatively the ACDM picture can be modified by the introduc- 
tion of a break of the power-spectrum at the small sc ales (e.g., 
iKamionkowski & Liddlell2000l : IZentner & Bullockll2003h . 
From an astrophysical point of view, the number of visible satel- 
lites can be reduced by suppressing the gas collapse/cooling ( e.g., 
iBullock etai1l200d : iKravtsov et all |2004 iMoore et all l2006h via 
supernova feedback, photoionization or reionization. This would 
result in a high mass-to-light ratio (M/L) in the substructures. If 
these high-M/L substructures indeed exist, different methods for 
indirect detection are possible. The dark substructure may be de- 
tectable for example through its eff ects on stellar streams (e.g., 
Ibat a et al. I l2002l; May er et a Jl2002h. via y-rays from dark mat- 
ter annihilation (jBergstrom et al Jl999l:ICalcaneo-Roldan & Moord 



l2000l : IStoehr et alj 120031 ; IColafrancesco et al.1 I2006T) or through 
gravi tational lensing (e.g.. iDalal & Kochanek! 120021 ; IKoopmansI 
l2005h . 

While the first two approaches are limited to the local Universe, 
gravitational lensing allows one to explore the mass distribution 
of galaxies outside the Local Group and at a relatively high 
redshift. Moreover, gravitational lensing is independent of the 
baryonic content, of the dynamical state of the system and of 
the nature of dark matter. For example, when in a lens system a 
point source is close to the caustic fold or cusp, the sum of the 
image fluxes should add to zero if the sign of t he im a ge parities 
is taken into account jBlandford & Naravanl 1 19861 ; (zakharovl 
1 19951) . This relation is, however, violated by many observed 
lensed quasars with cus p and fold images. As first suggested by 
iMao & Schneideij dl998l) . these flux ratio anomalies can be related 
to the presence of (dark matter) substructure aroun d the lensing 
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Keeton et al.l l2005h . Nevertheless subsequent studies of 



similar gravitationally lensed systems have shown that the required 
mass fraction in subst ructure is higher than what is obtained in 
numerical simulations dMao et alj|2004l ; iMaccio & Mirandall2006l : 
iDiemand et alj 12007 al) . In addition, for a significant number of 



into account the luminous dwarf satellite population dTrotter et al. 
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Whether the mass fraction of CDM substructures is quantifiable 
via flux ratio anoma lies is therefore a question still open for 
debate. Alternatively, IKoopmansI d2005h showed that dark matter 
substructure in lensing galaxies can be detected by modelling of 
multiple images or Einstein rings from extended sources. 

In this paper, we developed an adaptive grid-based modelling 
code for extended lensed sources and grid-based potentials, 
to fully quantify this procedure. The method presented here 
is a significan t impr o vement of the techn i ques introduced b y 
Warren & Dye 1 2003 ). Dye & Warrenl d2005h . IKoopmansI d2005h. 



Suvu & Blandfordl d2006l) ISuvu et alj d2006h and Brewer & Lewis! 
1 20061) . [n order to detect mass substructure in lens galaxies 



one needs to solve simultaneously for the source surface bright- 
ness distribution and the lens potential. A semilinear technique 
for the reconstruction of grid-based sou rces, given a paramet - 
ric lens potential, was first introduced by Warren & Dyd d200"3h . 
The method was subsequently extended by Koopmans] 1 20051) and 



ISuvu& Blandfordl d2006l) in order to include a gr id-based potential 
for the len s and bvlBarnabe & Koopmans! d2007l) to include galaxy 
dynamics. iDve & Warrenl d2005l) introduced an adaptive gridding 
on the source plane; this would minimize the covariance between 
pixels and decrease the computational effort. However the method 
is still lac king an objective p roce dure to quantify the leve l of regu- 
larization. ISuvu et al.1 d2006l) and lBrewer & Lewisl d2006l) encoded 
the semi-linear me t hod w ithin the framework of Bayesian statis- 
tics dMacKav 1992, 2003;). Although a vast improvement, the fixed 
grids do not allow to take into account the correct number of de- 
grees of freedom and proper evidence comparison is difficult. In 
the implementation here described, these issues have been solved: 

(i) the procedure is fully Bayesian; this allows us to determine the 
best set of non-linear parameters for a given potential and the linear 
parameters of the source, to objectively set the level of regulariza- 
tion and to compare/rank different potential families; 

(ii) using a Delaunay tessellation, the source grid automatically 
adaptives in such a way that the computational effort is mostly con- 
centrated in high magnification regions; 

(iii) the source-grid triangles are re-computed at every step of the 
modelling so that the source and the image plane always perfectly 
map onto each other and the number of degrees of freedom remains 
constant during Bayesian evidence maximisation. 

For the first time in the framework of grid- based lensing m odelling, 
we use the Nested-Sampling technique bv lSkillingl d2004l) to com- 
pute the ful l marginalized Bayesian evidence of the data dMacKavl 
Il992l l2003h . This approach not only provides statistical errors on 
the lens parameters, but also consistently quantifies the relative ev- 
idence of a smooth potential against one containing substructures. 
As such, our method provides a fully objective way to rank these 
two hypotheses given the data, which is the goal set out in this pa- 
per. 

The paper is organized as follow. In Section 2 we give a general 
overview on the data model. In Section 3 we present in detail how 
the data model can be inverted and the source and lens potential 
reconstructed. In Section 4 we review the basics of Bayesian statis- 
tics and of the Nested-Sampling technique for evidence computa- 
tion. In Section 5 we describe how the method has been tested and 
how its ability in detecting substructures, depending on the pertur- 
bation mass and position, has been studied. Finally in Section 6 
conclusions are drawn and future applications are discussed. 



2 CONSTRUCTION OF THE LENSING OPERATORS 

In this section, we describe the data model which relates the un- 
known source brightness distribution and lens potential to the 
known data of the lensed images. The aim is to put this proce- 
dure in a fully self-consistent mathematical framework, excluding 
as much as possible any subjective intervention into the modelling. 
The core of the method presented here is based on a Occam's ra- 
zor argument. From a Bayesian evidence point of view, correlated 
features in the lensed images are most likely due to structure in the 
source, rather than being the result of small-scale perturbations of 
the lens potential in front of all the lensed images. On the other 
hand, uncorrelated structure in the lensed images is most likely due 
to small-scale perturbations of the lens potential. 
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2.1 The data, source and potential grids 

The main idea of grid-based lensing techniques is to use a grid- 
based reconstruction of the source and of the lens potential. Here 
we introduce the general geometry of the problem, explicitly shown 
in Fig. Q] Consider a lensed image d of an unknown extended 
source s. Both d and s are vectors that describe the surface bright- 
ness distributions on a set of spat ial points xj and y s . in the lens and 
source plane, respectiv ely (e.g.. IWarren & Dye] [2003; Ko opmarisl 
l2005t ISuvu eTai]l2006h . In general, these are related through the 
lens equation y d =xf- ^ip{x d ), where x'j corresponds to the spatial 
position of the surface brightness in the ith element of the vector d, 
i.e. dj and if/(xf) is the lensing potential, which is described in more 
detail in a moment. We note that y d does not necessarily directly 
correspond to the elements y s ., jth brightness value of the vector s. 
In our implementation, the grid on the source plane is fully adap- 
tive and is directly constructed from a subset of the N d pixels in the 
image plane, with spatial boundaries of the image grid included. 
In particular, as shown schematically in Fig. Q] N s pixels, located 
each at a position xj on the image grid, are cast back to the source 
plane giving the positions y s .. The set of positions [y'} constitute 
the vertices of a Delaunay triangulation. In this way, we define an 
irregular adaptive grid, where vertex positions in the source plane 
are related to positions on the image plane via the lens equation and 
every vertex value represents an unknown source surface brightness 
level. 

We assume the lens potential to be the superposition of a paramet- 
ric smooth component with linear local perturbations related to the 
presence of e.g. CDM substructures or dwarf galaxies: 



ifi(x, if) = ip s {x, rj) + Sf(x). 



(1) 



While ift s (x,i]) assumes a parametric form, with parameters i], 
Sij/(x) is a function that is pixelized on a regular Cartesian grid of 
points xf 1 with values Sif/k. The set {Sif/t) is written as a vector Sift. 
Given the observational set of data d, we now wish to recover the 
source distribution s and the lens potential if/(x, if) simultaneously. 
To do this we need to mathematically relate the brightness values d 
to the unknown brightness values s. As described in the next sub- 
section, this can be done through a linear operation on s and Siff, 
where the operator itself is a function of an initial guess of the lens 
potential. 

2.2 The source and potential operator 

We now derive the explicit relation between the unknown source 
distribution s, the potential correction Sift, the smooth potential 
if/ s (x, ij) and the image brightness d. 

Consider a generic triangle ABC on the source plane (Fig. |2(aj) . 
then the source surface brightness sp on a point P, located inside the 
triangle at the position y'j,, can be related to the surface brightness 
on the vertices A, B and C through a simple linear relation 



Sp = W\S\ + WbSb + w c s c ■ 



(2) 



An explicit expression for the bilinear interpolation weights w A , We 
and w c can be obtained by considering the point Pj , at the intersec- 
tion of the line AP with the line CB. The source intensities at P and 
Pi are also related to each other through a linear interpolation. On 
the other hand, the surface brightness in Pi is directly related to the 
values on the triangle vertices B and C 



Sp = ^7 ( *' " Sa) + Sa 



(3) 
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Figure 1. A schematic overview of the non-linear source and potential re- 
construction method, as implemented in this paper. On the left hand-side, 
on the image plane, two grids are defined: one for the potential corrections 
and one for the lensed image. A subset of N s of the Nj image pixels lo- 
cated at the positions x s on the image plane (filled circles) is cast back to 
the source plane (on the right) on y* through the lens equation. These form 
the vertices of an adaptive grid on the source plane. The remaining image 
pixels (open circles) are also cast to the source plane to the positions y d (we 
note that this set of points includes gfT), Because the source brightness distri- 
bution is conserved, i.e S{x d ) = S{y d ), the surface brightness at the empty 
circles is represented by a linear superposition of the surface brightness at 
the three triangle vertices that enclose it. Similarly the potential correction 
at a point x. is given by linear interpolation of the potential corrections at 
the suiTounding pixels (large rectangular pixels on the image plane). 



where dp A and d P[A are the absolute distances between the points 
P and A and the points Pi and A respectively; rf PlB and d C B are 
the distances between the points Pi and B and the points C and B 
respectively. Solving (3), we obtain the weights 



w A = 1 



^PA 
dp. A 



Wb 



dpA / 1 _ "V \ 

dp, A \ <>CB j 



(4) 



dpA^Pj B 
^P,A f/ CB 



with Y,i=Ahc w i = 1- Because gravitational lensing conserves the 
surface brightness, i.e. S(xf) = S(yf), the mapping between the 
two planes (when Sip = 0) can be expressed as a system of N s 
coupled linear equations 



BL(7/)s = d + n, 



(5) 



spectively (see e.) 


j. Warren & Dve 2003; Treu & KooDmansll2004l; 


lKoonmansll2005l; 


Suvu & Blandford 200fj|). The blurring operator 



is a square sparse matrix which accounts for the effects of the 
PSF Each row of the lensing operator (a sparse matrix) contains at 
most the three bilinear interpolation weights, ioa,b,c, placed at the 
columns that correspond to the three source vertices that enclose 
the associated source position. For a vertex point, there is only one 
weight equal to unity. In case N s = Nd (i.e. all image positions are 
used to create the source grid), all weights are equal to unity. In 
this case, the systems of equations is under-constrained and strong 
regularization is required. 

By pixelating 6if/(x) on a regular Cartesian grid, a similar argument 
as for the source can be applied to the potential correction; all po- 
tential values, [Si//);}, and their derivatives on the image plane can 
be r elated to this limited set of points th rough bilinear interpolation 
(see lKoopma ns 2005; Suvu et al.l2008l) . It is then possible to derive 
from equation ((5) a new set of linear equations, 



M c {q, if/) r = d + n, 



(6) 
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where 



s 

Sip 



(7) 



More specifically, ifi is the sum of all the previous corrections Sifr 
and the operator M c is a block matrix reading 



M, 



£B[L(j7,^)|-D,(smp)D* 



(8) 



L(//, iff) is the lensing operator introduced above, D s (s MP ) is a sparse 
matrix whose entries depend on the surface brightness gradient of 
the previously-best source model at y'j and is a matrix that 
determines the g radient of Sifr at all corresponding points x^ (see 
Koopmans 2005, for details). The generic structure of these matri- 
ces is given by 



D s 



as bf!) 
am 



8m 8ij 2 



(9) 




P2 > 
e — 
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/ pi 
— e 










Figure 2. Generic triangles from the source grid. Both the source surface 
brightness and its derivatives at the points P, Pi and P2 are given by linear 
superposition of the values at the edges of the surrounding triangles. 



and 



D 



Si/i 



dx\ 
3x2 



(10) 



amx d M ) 

ax\ 
amx d M ) 

~ 2 



where the index i runs along all the and y^, i.e. triangle vertices 
included. The "functions" S and 5i]/ and their derivative can be de- 
rived through bilinear interpolation and finite differencing from s 
and Sifr, respectively. 

It is clear from the structure of these matrices that the first-order 
correction to the model, as a result of t?i/r , is equal to Sd , = 
-VSQff) ■ VSif/(x^) at every point x'! (see e.g. [Koopmans 20Q5|, f° r 
a derivation). 

As for the surface brightness itself, also the first derivatives for a 
generic point P on the source plane can be expressed as functions 
of the relative values on the triangle vertices A, B, C, yielding 



dsp 


ds A 


9sb 


ds c 


dyi 


oyi 


+ W B-T— 

oy\ 


+ w c — 

dyi 


dsp 


ds A 


ds B 


ds c 


dy 2 


dy 2 


+ W B-T— 

dy 2 


+ w c-r— 

dy 2 



(11) 



A, B, C these are given by -P- = —'— and 

' & J dyi 112 

(n , iif , n 2 ) is the unit-length surface normal 



For the generic vertex j 

■P- = - — , where N : 
vector at the vertex j and is defined as the average of the adjacent 
per-face normal vectors. For Sip and it s gradients, on a re ctangular 
grid with rectangular pixels, we follow lKoopmansI d2005h . 



3 INVERTING THE DATA MODEL 

As shown above, in both cases of solving for the source alone, 
or solving for the source plus a potential correction, a linear 



data model can be constructed. In this section, we give a general 
overview of how this set of linear equations can be (iteratively) 
solved. A more thorough Bayesian description and motivation can 
be found in Section 4. 



3.1 The penalty function 

Before we go into the details of the method, we first restate that 
for a given lens potential if/(x, if) and potential correction ifr n = 
YH=\ Sipj, on a grid, the source surface brightness vector 5 and the 
data vector d can be related through a linear (matrix) operator 



M c('7> l / , „-i> s n-i)''« =d + n, 



(12) 



now explicitly written with their dependencies on the source and 
potential and with 



(13) 



In this equation s„ is a model of the source brightness distribution at 
a given iteration n (we describe the iterative scheme momentarily). 
We assume the noise n to be Gaussian which is a good approxima- 
tion for the HST images the method will be applied to. Even in case 
of deviations from Gaussianity, the central limit theorem, for many 
data points, ensures that the probability density distribution is often 
well approximated by a Normal distribution. 
Because of the ill-posed nature of this relation, equation d!2t can- 
not simply be inverted. Instead a penalty function which expresses 
the mismatch between the data and the model has to be defined by 



with 



■X 



+ <}?IIH.s||t + /li,.||H A 



(14) 



X' 



\MM fn-i.Sn-1 )r-df C, 1 [Mcto, )r-d]. (15) 



The second and third term in the penalty function contain prior in- 
formation, or beliefs about the smoothness of the source and of 
the potential respectively and d is the diagonal covariance ma- 
trix of the data. The level of regularization is set by the regulariza- 
tion parameters A, one for the source and one for the potential (see 
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lKoopmans| [2005: Suv u et aljExxSL for a more general discussion). 
In a Bayesian framework, this penalty function is related to the pos- 
terior probability of the model given the data (see Section 4). In the 
following two sections we describe how to solve for the linear and 
non-linear parameters of the penalty function (except for A, which 
is described in Section 4). 

3.1.1 Solving for the linear parameters 

The most probable solution, r MP , minimizing the penalty function 
is obtained by solving the set of linear equations 



(Mj C^M C + R T R) r = MjC^d. 
The regularization matrix is given by 
/i;HjH s 



R R 



(16) 



(17) 



The solution of this symmetric positive definite set of equations can 
be found using e.g. a Cholesky decomposition technique. By solv- 
ing equation i\6) . adding the correction 5ip n to the previously-best 
potential ifr n _ l and iterating this procedure, both the source and the 
potential should converge to the minimum of the penalty function 
P(s„, Sip n 1 1], A, s n -i, <P„-\ )• At every step of this iterative procedure 
the matrices M c and R have to be recalculated for the new updated 
potential ip n and source s„. While the potential grid points are kept 
spatially fixed in the image plane, the Delaunay tessellation grid of 
the source is re-built at every iteration to ensure that the number of 
degrees of freedom is kept constant during the entire optimization 
process. 

Note that because the source and the potential corrections are inde- 
pendent, they require their own form (H) and level (A) of regular- 
ization. The most common forms of regularization are the zeroth- 
order, the gradient and the curvature. As shown by ISuvu et all 
d2006h the best form depends on the nature of the source distri- 
bution and can be assessed via Bayesian evidence maximisation. 
For the source, we chose the curvature regularization defined for 
the Delaunay tessellation of the source plane. 
Specifically one can combine the gradient and curvature matrices 
in the x and ij directions: HjH s = Hj yi H syi +Hj y? H sy , . Both H syi 
and H s , yj can be obtained by analogy by considering the pair of 
triangles in Fig. |2(b)| and Fig. |2(c)| respectively. 
For every generic point C on the source plane we consider the pair 
of triangles ABC and DCE and define the curvature in C in the y t 
direction as: 

1 . .1 



-(Sp - s c ) ■ 



ICQ 



-(.SC ~ Sq). 



(18) 



This is not the second derivative, but we find that this alternative 
curvature definition gives much better results than using the second 
derivative directly. The reason is that it gives equal weight to all 
triangles, independently of their relative sizes (for identical rectan- 
gular pixels this problem does not arise since the above definition 
is equal to the second derivative up to a proportionality constant). 
A much smoother solution in that case is obtained. 
P and Q are given by intersecting the line CPi with the line ED and 
the line CP 2 with the line AB respectively. Specifically, Pj and P 2 
are defined as very small displacements from the point C in the y t 
direction 



c 



= yC±S y] . (19) 
The source surface brightness in P and Q can be obtained by linear 



interpolation between the source values in D with the value in E 
and the value in A with the value in B respectively 



s? 



-(s B - s D ) + s D 



■IQA 



sq = -f—(s B -s A ) + s A , 

tfAB 

Substituting d20b in dT8l > gives 



(20) 



"Cm 



1 1 \ d m 

H Sc H S E + 

"CP «CQ / dcpdj)E 



■'OA 



S B + 



PE "QB 

s d + —, ; — s A . 



(21) 



rfcQ^AB d C pd 0E rfcQ^AB 

Each row of the regularization matrix H syi , corresponding to ev- 
ery point C, contains the five interpolation weights, placed at the 
columns that correspond to the five vertices A, B, C, D and E. The 
curvature in the 1/2 direction is derived in an analogo us way us- 
ing th e pair of triangles in Fig. |2(c)| We refer again to iKoopmansI 
J2005h for details on the potential regularization matrix 



3.1.2 Solving for the non-linear parameters 

In order to recover the non-linear parameters ij, we need to mini- 
mize the penalty function P(s, r\ \ A, if/). We allow for a correction, 
iff, to the parametric potential if/(j], x) (not necessarily zero), but do 
not allow it to be changed while optimising for s and r\. In all cases, 
we keep A fixed during the optimization. Given an initial guess for 
the non-linear parameters rj Q , we then minimize the penalty func- 
tion defined in Section [3.1.1l under the conditions outlined above 
(iff is constant and Sift = 0). We use a non-linear o ptimizer (in 
our c ase Downhill-Simplex with Simulated Annealing; IPress et al.l 
1 19921) , to change r\ at every step and to minimize the joint penalty 
function P(s, t] | A, iff ). The optimization of s is implicitly embedded 
in the optimization of t] by solving equation l |16t only for s, every 
time i] is modified. 



3.2 The optimization strategy 

We have implemented a multi-fold optimization scheme for solving 
the linear equation dl2t . This scheme is not unique, but stabilises 
the numerical optimization of this rather complex set of equations. 
Solving all parameters simultaneously would be computationally 
prohibitive and usually shows poor convergence properties. 



3.2.1 Optimization steps 

Our optimization scheme is similar to a line-search optimization, 
where consecutively different sets of unknown parameters are being 
kept fixed, while the others are optimized for. The sets {Sip, s}, [rj, s} 
and {A, s] define the three different groups of parameters, of which 
only one is solved for at once. The individual steps, in no particular 
order, are then: 

(i) We assume r\ and A to be constant vectors and iteratively solve 
for Sip and the source s. In this case, at every iteration we solve for 
r and adjust ip, using the linear correction to the potential Sip. This 
was described in Section [3.1.1l 

(ii) We assume ip and A to be constant vectors and Sip t = at every 
iteration and only solve for the non-linear potential parameters r\ 
and the source s. This was described in Section [3.1.2l We note that 
part of step (i) is also implicitly carried out in step (ii) (i.e. solving 
for s). 
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(iii) We assume both (i) and (ii), above, and solve for the regu- 
larization parameters A s of the source and the source itself s. This 
requires a Bayesian approach and will be described in more detail 
in Section 4. We have not attempted to optimize for Ag^, but will 
study this in future publications. 

The overall goal, however, remains to solve for the full set of un- 
known parameters (;/, i//„,s„} for n — > <x> (or some large number). 
In particular if an overall smooth (on scales of the image separa- 
tions) potential model tf/(i]) does not allow a proper reconstruction 
of the lens system, we add an additional and more flexible potential 
correction Stfi, which can describe a more complex mass structure. 



3.2.2 Line-search optimization scheme 

In practice, we find that the optimal strategy to minimize the 
penalty function is the following, in order: 

(1) We set A s to a large constant value such that the source model re- 
mains relatively smooth throughout the optimization (i.e. the peak 
brightness of the model is a factor of a few below that of the data) 
and keep if/ n = (see also lSuvu et al.l 2006. 2008). We then solve 
for r\ and s that minimize the penalty function. 

(2) Once the best r\ and s are found, a Bayesian approach is used to 
find the best value of A s for the source only. At this point if/ is still 
kept equal to zero. 

(3) Given the new value of A s , step (1) is repeated to find improved 
values of r\ and s . Since the sensitivity of A s to changes in i] is rather 
weak, at this point the best values of r\, s and A have been found. 

(4) Next, all the above parameters are kept fixed and we solve for r, 
this time assuming a very large value for A 6l p to keep the potential 
correction (and convergence) smoo th. We adjust iff at every iteration 
until convergence is reached (e.g. ISuvu et al.ll2~008l) . At this point 
we stop the optimization procedure. 

(5) The smooth model with if/ = and the same model with iff + 
are then compared through their Bayesian evidence values and 
er rors on the para meters are estimated through the Nested Sampling 
of lSkillind d2004l) (Section 4). 

Fig.[3]shows a complete flow diagram of our optimization scheme. 
In the next section we place equation d 1 4b and model ranking on 
a formal Bayesian footing. Those readers mostly interested in the 
application and tests of the method could continue reading in Sec- 
tion 5. 



4 A BAYESIAN APPROACH TO DATA FITTING AND 
MODEL SELECTION 

When trying to constrain the physical properties of the lens galaxy, 
within the grid-based approach, three different problems are faced. 
Given the linear relation in equation ((6) we need to determine the 
linear parameters r for a certain set of data d and a form for the 
smooth potential tf/ s (x,t]). We then aim to find the best values for 
the parameters i] and A and finally, on a more general level, we wish 
to infer the best model for the overall potential and quantitatively 
rank different potential families. In particular, we want to compare 
smooth models with models that also include a potential grid for 
substructure (with more free parameters). These issues can all be 
quantitatively and objectively addressed within the framework of 
Bayesian statistics. In the cont ext of data modelling three levels of 
inference can be distinguished dMacKavll 19921 ; ISuvu et alj|2006l) . 

(1) First level of inference: linear optimization. We assume the 
model M c , which depends on a given potential and source model, 



to be true and for a fixed form R and level (A) of regularization, we 
derive from Bayes' theorem the following expression: 



P(r\d,A,ii,M c ,R) 



P(d | r, ij, Mc) P(r | A, R) 



(22) 



P(d\A, tj, M C ,R) 

The likelihood term, in case of Gaussian noise, for a covariance 
matrix Cd, is given by 



(23) 



(24) 



P(d | r, Tj, Mc) = ^ exp [-E d (d \ r, ij, M c )] 
where 

Z d = (27r) w " /2 (det C d ) 1/2 
and (see equation [l"5l 

E d (d I r, »7, M c ] = l -x 2 = \ W c r - df C D ' (M c r - d) . (25) 

Because of the presence of noise and often the singularity of 
det (MjM,.), it is not possible to simply invert the linear relation 
in equation l(6j but an additional penalty function must be defined 
through the introduction of a prior probability P(r \ A, R) on s and 
on Sifr. In our implementation of the method, the prior assumes 
a quadratic form, with minimum in r = and sets the level of 
smoothness (specified in H and A) for the solution 



P(r\A,R) = -exp[-AE r (r\R)], 



with 



Z r (A) = I dre~ m ' = e- M 'V> I ^ I (det C)~ 1/2 , 



In 



N,/2 



E r = i||Rr|g 
and 

C = VV£ r = RR T . 



(26) 



(27) 



(28) 



(29) 



The normalization constant P(d \ A, jj, M c , R) is called the evidence 
and plays an important role at higher levels of inference. In this 
specific case it reads 



P(d\ A, J7,M C ,R) 
where 

M(r) = E d + E r . 



j drexp (-M(r)) 



Z d Z r 



(30) 



(31) 



The most probable solution for the linear parameters, is found by 
maximizing the posterior probability 



P(r\d,A,i],Mc,K-) 



exp(-M(r)) 
f dr exp(-M(r)) 



(32) 



The condition d(E d + E, )jdr = now yields the set of linear equa- 
tions already introduced in Section [3.1.1| 

(M^Cf 'M c + R T R) r = MlC A -'d. (33) 

Equation {33} is solved iteratively using a Cholesky decomposition 
technique. 

(2) Second level of inference: non-linear optimization. At this 
level we want to infer the non-linear parameters rj and the hyper- 
parameter A s for the source. Since at this point we are interested 
only in the smooth component of the lens potential, we set Si// = 
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S 
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Optimise Tj + »J sub , A 



Fix Sip = 



Ss 



Figure 3. A schematic overview of the non-linear source and potential reconstruction method. 



and for a fixed family tA s (j/), form of the regularization R and model 
M c , we maximize the posterior probability 



P(A,J7|rf,M c ,R) 



P(d\A, ?7,M C ,R)PU, rj) 
P(d\M c ,K) 



(34) 



Assuming a prior P(A, rf), which is flat in log(/l,) and rj, reduces 
to maximizing the evidence P(d \ A, rj, M c , R) (which here plays the 
role of the likelihood) for r\ and A. The evidence can be computed 
by integrating over the posterior d34t 



P(d\A,ti,Mc,R) = drP(d\r,ti,Mc)P(r\A,K). 



(35) 



Because of the assumptions we made (Gaussian noise and quadratic 
form of regularization), this integral can be solved analytically and 
yields 



P(d\A,ti,Mc,K) 



where 



2 M (A, jj) 
Z d Z,-(A) ' 



Z M (A, rf) = exp (-M(r MP )) (2n) N ' jl (det A) 



-1/2 



(36) 



(37) 



with A = VVM(r). Again we proceed in an iterative fashion: using 
a simulated annealing technique we maximize the evidence J35l > for 
the parameters r\. Every step of the maximisation generates a new 
model M c (i^(j7 i )), for which the most probable source s MP is recon- 
structed as described in Section[3] At this starting step the level of 
the source regularization is set to a relatively large initial value A s q; 
in this way we ensure the solution to be smooth (at least at this first 
level) and the exploration of the // space to be faster. Subsequently 
we fix the best model M c (// ) found at the previous iteration and, 
using the same technique, we maximize the evidence for the source 
regularization level A s . The procedure is repeated until the total ev- 
idence has reached its maximum. In principle we should have built 
a nested loop for A s at every step of the 77 exploration, but in prac- 
tice the regularization constant only changes slightly with rj and 
the alternate loop described above gives a faster way to reach the 
maximum (line-search method). 



(3) At the third level of inference Bayesian statistics provides an ob- 
jective and quantitative procedure for model comparison and rank- 
ing on the basis of the evidence, 



P(M C , R I d) oc P(d I Mc, R)P(M C , R) . 



(38) 



For a flat prior P(M C , R) (at this level of inference we can make lit- 
tle to no assumptions) different models can be compared according 
to their value of P(d \ M c , R), which is related to the evidence of the 
previous level by the following relation 



I 



P{d I Mc, R) = dA di] P(d I A, r\, M c , R)P(A, //) . 



(39) 



Being multidimensional and highly non-linear, the integral ( 1391 ) 
is carried out numerically through a Nested-Sampling technique 
jSkillind[2004l) , which is described in more detail in the next sec- 
tion. A by-product of this method is an exploration of the posterior 
probability d34| >, allowing for error analysis of the non-linear pa- 
rameters and of the evidence itself. 



4.1 Model selection: smooth versus clumpy models 

In the previous section we introduced the main structure of the 
Bayesian inference for model fitting and model selection. While 
parameter fitting simply determines how well a model matches the 
data and can be easily attained with the relatively simple analytic 
integrations of the first and second level of inference, model selec- 
tion itself requires the highly non-linear and multidimensional in- 
tegral l |39b to be solved. This marginalized evidence can be used to 
assign probabilities to models and to reasonably establish whether 
the data require or allows additional parameters or not. Given two 
competing models Mq and Mi with relative marginalized evidence 
£ and 61, the Bayes factor, A£ = log£ _ l°g£i> quantifies 
how well Mo is supported by the data when compared with Mi 
and it automatically includes the Occam's razor. Typically the lit- 
erature sugges ts to weigh the Bayes factor using Jeffreys' scale 
jjeffrevs|[l9olT) , which however provides only a qualitative indi- 
cation: A£ < 1 is not significant, 1 < A£ < 2.5 is significant, 
2.5 < A£ < 5 is strong and A£ > 5 is decisive. 



© 2008 RAS, MNRAS 000,[T|-?? 



8 S. Vegetti & L. V. E. Koopmans. 



In order to evaluate this marginalized evidence with a high enough 
accuracy we implemented the n ew evidence al gorithm known as 
Nested Sampling, proposed by ISkillind <2004 . Specifically, we 
would like to compare two different models: one in which the lens 
potential is smooth and one in which substructures are present, with 
e.g. a NFW profile. While the first is defined by the non-linear pa- 
rameters of the lens potential and of the source regularization only, 
the second also allows for three extra parameters: the mass of the 
substructure and its position on the lens plane (see Section[5} 

4.2 Model ranking: nested sampling 

Here, we provide a short description of how the Nested Sampling 
can be used to compute the marginalized evidence and errors on 
the mo del parameters; a more detailed one can be found in Skill ins 
j2004h . The Nested-Sampling algorithm integrates the likelihood 
over the prior volume by moving through thin nested likelihood sur- 
faces. Introducing the fraction of total prior mass X, within which 
the likelihood exceeds X* , hence 



r= I dx, 

J£>£' 

with 

dX = P(A,r])dAdi], 

the multi-dimensional integral J39l > relating the likelihood X and 
the marginalized evidence £ can be reduced to a one-dimensional 
integral with positive and decreasing integrand 



(40) 



(41) 



i 

Jo 



dXX(X). 



(42) 



Where X(X) is the likelihood of the (possibly disjoint) iso- 
likelihood surface in parameter space which encloses a total prior 
mass of X. If the likelihood Xj — -C(Xj) can be evaluated for each of 
a given set of decreasing points, < X ) < Xj-\ < .... < 1, then the 
total evidence £ can be obtained, for example, with the trapezoid 
mlc,6 = Y l J =l 8 j = ZJ =l %(Xj„ 1 -Xj +1 ), 

The power of the method is that the values of Xj do not have to 
be explicitly calculated, but can be statistically estimated. Specifi- 
cally, the marginalized evidence is obtained through the following 
iterative scheme: 

(1) the likelihood X is computed for N different points, called active 
points, which are randomly drawn from the prior volume. 

(2) the point Xj with the lowest likelihood is found and the corre- 
sponding prior volume is estimated statistically: after j iterations 
the average volume decreases as Xj/Xj-j = t, where t is the ex- 
pectation value of the largest of N numbers uniformly distributed 
between (0, 1). 

(3) the term Sj = ^ (x M - X j+l ) is added to the current value of 
the total evidence; 

(4) Xj is replaced by a new point randomly distributed within the 
remaining prior volume and satisfying the condition X > X* = Xj', 

(5) the above steps are repeated until a stopping criterion is satis- 
fied. 

By climbing up the iso-likelihood surfaces, the method, in general, 
find and quantifies the small region in which the bulk of the evi- 
dence is located. 

Differ ent stopping criteria can be chosen. Following Skillina 
< |2004|) . we stop the iteration when j » NH, where H is mi- 
nus the logarithm of that fraction of prior mass which contains 
the bulk of the posterior mass. In practical terms this means that 
the procedure should be stopped only when most of the evidence 



has been included. Given the areas Sj, in fact, the likelihood ini- 
tially increases faster than the widths decrease, until its maximum 
is reached; across this maximum, located in the region £ =s e~ H , 
the likelihood flatten off and the decreasing widths dominate the 
increasing Xj. Since £, « e~ ;/N , it takes NH iterations to reach 
the dominating areas. These NH iterations are random and are sub- 
jected to a standard deviation uncertainty VNH, corresponding to 
a deviation standard on the logarithmic evidence of VNH/N 



log£ = log 



with cr log6 



(43) 



4.2.1 Posterior probability distributions 

For the lens parameters, the substructure position and the logarithm 
of the source regularization, priors are chosen to be uniform on a 
symmetric interval around the best values which we have deter- 
mined at the second level of the Bayesian inference. The size of 
the interval being at least one order of magnitude larger than the 
errors on the parameters. In practice, we first carry out a fast run 
of the Nested Sampling with few active points N, this gives us 
an estimate for the non-linear parameter errors. Using the product 
2 x Wdim x o~n, where Na m is the total number of parameters and cr, 
the corresponding standard deviation, we can then roughly enclose 
the bulk of the likelihood (note that this can be double-checked 
and corrected in hindsight, if the posterior probability functions 
are truncated at the prior boundaries). Priors on the parameters are 
taken in such a way that this maximum is fully included in the total 
integral of the marginalized evidence. For the main lens parame- 
ters and for the regularization constant the same priors are used for 
model with and without substructure. For the substructure mass a 
flat prior between M min = 4.0 x 10 6 M o and M max = 4.0 x 10 9 M o 
is adopted, with the t wo limits given by N-body simulations (e.g. 
iDiemand et al.l2007 al ibi). In reality, the method does not require the 
parameters to be well known a priori, but limiting the exploration to 
the best fit region sensibly reduces the computational effort without 
significantly altering the evidence estimation. From Bayes theorem 
we have that the posterior probability density pj is given by 



Pj(t) 



^-(X M -X J+l )/S(t): 



wj/6(t) . 



(44) 



The existing set of points (j/, X) lt ..., (i], A) N then gives us a set of 
posterior values that can be then used to obtain mean values and 
standard deviations on the non-linear parameters 



(45) 



and similarly for A. These samples also provide a sampling of the 
full joint probability density function. Marginalising over this func- 
tion, the full marginalized probability density distribution of each 
parameters can be determined (see Section 5.5). 



5 TESTING AND CALIBRATING THE METHOD 

In this section we describe the procedure to test the method intro- 
duced above and to assess its ability to detect dark matter substruc- 
tures in realistic data sets (e.g. from HST). A set of mock data, 
mimicking a typical Einstein ring, is created. We generate four- 
teen different lens models, of which Lo is purely smooth, Li^n 
are given by the superposition of the same smooth potential with 
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a single NFW dark matter substructure of varying mass and posi- 
tion and L13 contains two NFW dark matter substructures with the 
same mass but with different positions (See Table [TJ. A first ap- 
proximate reconstruction of the source and of the lens potential is 
performed by recovering the best non-linear lens parameters r\ and 
the level of source regularization A s . These values are then used 
for the linear grid-based optimization, which provides initial val- 
ues of the substructure position and mass. Three extra runs of the 
non-linear optimization are then performed to recover the best set 
(i] b ,A s j,) for the main lens and the best mass and position of the 
substructure (solely modelled with a NFW density profile). Finally 
by means of the Nested-Sampling technique described in Section 
14. 11 we compute the marginalized evidence, equation l |39K for ev- 
ery model twice, once under the hypothesis of a smooth lens and 
once allowing for the presence of one or two extra mass substruc- 
tures. Comparison between these two models allows us to assess 
whether the presence of substructure in the model improves the ev- 
idence despite the larger number of free parameters. 

5.1 Mock data realisations 

A set of simulated data with realistic noise is generated from a 
model based on the real lens SLACS J1627 -0055 dKoopmans et al.l 
l2006l : lBolton et al]|2006r , lTreu et alj|2006h . W e assume the le ns to 
be well described by a power-law (PL) profile dBarkanall 19981) . Us- 
ing the optimization technique described in Section l(4j we find 
the best set of non-linear parameters (i] b , A sb ). In particular 77 con- 
tains the lens strength b, and some of the lens-geometry parame- 
ters: the position angle 8, the axis ratio /, the centre coordinates 
Xq and the density profile slope q, (p cc r -( 2 </ +1 ^. if necessary, infor- 
mation about external shear can be included. The best parameters 
are used to create fourteen different lenses and their corresponding 
lensed images. One of the systems is given by a smooth PL model 
while twelve include a dark matter substructure with virial mass 
M vir = 10 7 M Q , 10 8 M o , 10 9 M o located either on the lowest surface 
brightness point of the ring P , on a high surface brightness point of 
the ring P\ , inside the ring Pj and outside the ring P3 (see Table[T](. 
The fourteenth lens contains two substructures each with a mass of 
M vir = 10 8 M G , located respectively in P and P\ . The substructures 
are assumed to have a NFW profile 

p(r)=p.(r,/r)[i + (r/r,)Y*, (46) 

where the concentration c = r vir /r, and the scaling radius r s are 
obtained from the virial mass usin g the empirical scaling laws pro- 
vided by iDiemand et al. I j2007iM . The source has an elliptical 
Gaussian surface brightness profile centred in zero 

s(y) = s exp [-( yi ISy{f - (y 2 /Sy 2 ) 2 ] ■ (47) 

We assume .so = 0.25, 6y\ = 0.01 and Sy 2 = 0.04. 

5.2 Non-linear reconstruction of the main lens 

We start by choosing an initial parameter set i] for the main lens, 
which is offset from 7/ tlllc that we used to create the simulated data. 
Assuming the lens does not contain any substructure we run the 
non-linear procedure described in Section © and optimize (//, A s ] 
for each of the considered systems. At every step of the optimiza- 
tion a new set (77,, /ij,,) is obtained and the corresponding lensing 
operator M c (j/,-, /!.,,,) has to be re-computed. The images are de- 
fined on a 81 by 81 pixels (Nj = 6561) regular Cartesian grid while 
the sources are reconstructed on a Delaunay tessellation grid of 



Table 1. Non-smooth (PL+NFW) lens models. At each of the P, positions 
a NFW perturbation of virial mass m sub is superimposed on a smooth PL 
mass model distribution. 



Lens 




*suh (arcsec) 


m mb (Mq) 


Li 


Po 


= (+0.90; +1.19) 


10 7 


L 2 






10 s 


L3 






LU 


u 


Pi 


= (-0.50; -1.00) 


H) 7 


u 






10 s 


u 






10 9 




P 2 


= (-0.10; -0.60) 


10 7 


u 






10 s 


u 






10 9 


L10 


P 3 


= (-0.90; -1.40) 


10 7 


Ln 






10 s 


L12 






10 9 


L13 




P and Pi 


10 8 



N s = 441 vertices. The number of image points, used for the source 
grid construction, is effectively a form of a prior and the marginal- 
ized evidence (equation|39t can be used to test this choice. To check 
whether the number of image pixels used can affect the result of 
our modelling, we consider the smooth lens Lo and perform the 
non-linear reconstruction using one pixel every sixteen, nine, four 
and one. In each of the considered cases we find that the lens pa- 
rameters are within the relative errors (see Table 3). This suggests 
that, for this particular case, the choice of number of pixels is not 
influencing the quality of the reconstruction. In real systems, the 
dynamic range of the lensed images could be much higher and a 
case by case choice based on the marginalized evidence has to be 
considered. In Fig. [5] the residuals relative to the system Li are 
shown; the noise level is in general reached and only small resid- 
uals are observed at the position of the substructure. Whether the 
level of such residuals and therefore the relative detection of the 
substructure are significant is an issue we will address later on in 
terms of the total marginalized evidence. 



5.3 Linear reconstruction: substructure detection 

The non-linear optimization provides us with a first good approx- 
imate solution for the source and for the smooth component of 
the lens potential. While this is a good description for the smooth 
model Lo (see Fig. [4j», the residuals (e.g. Fig. [7} for the perturbed 
model L iM indicate that the no-substructure hypothesis is improb- 
able and that perturbations to the main potential have to be consid- 
ered. If the perturbation is small, this can be done by temporarily 
assuming that 7/ i=1 reflects the true mass model distribution for the 
main lens and reconstruct the source and the potential correction by 
means of equation | |33I >. In order to keep the potential corrections 
in the linear regime, where the approximation l |33l > is valid, both 
the source and the potential need to be initially over-regularised: 
A s = 10A s ,i and A^ = 3.0 x 10 5 dKoopmans! 120051 : ISuvu et all 
1 2006]) . For each of the possible substructure positions we iden- 
tify the lowest-mass-substructure we are able to recover. In the two 
most favourable cases, L[ and L 4 , in which the substructure sits on 
the Einstein ring a perturbation of 10 7 M G is readily reconstructed. 
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For these two positions higher mass models, with the exception of 
L 2 , will not be further analysed. The systems L 7 89 and Lio,n,i2> 
in which the substructure is located, respectively, inside and out- 
side the ring, represent more difficult scenarios. In the first case all 
perturbations below 10 9 M o can be mimicked by an increase of the 
mass of the main lens within the ring, while in the second case these 
cannot be easily distinguished from an external shear effect. For the 
models Lj ,2,4,9,12 convergence is reached after 150 iterations and the 
perturbations are recovered near their known position (Figs. 8 and 
9). The grid based potential reconstruction indeed leads to a good 
first estimation for the substructure position. 

5.4 Non-linear reconstruction: main lens and substructure 

In order to compare with numerical simulations, the mass of the 
substructure is required. Performing this evaluation with a grid 
based reconstruction is more complicated and requires some as- 
sumptions (e.g. aperture). To alleviate this problem we assume a 
parametric model, in which the substructures are described by a 
NFW density profile, and we recover the corresponding non-linear 
parameters, mass and position, using the non-linear Bayesian opti- 
mization previously described. 

To quantify the mass and position of the substructure and to update 
the non-linear parameters when a substructure is added, we adopt 
a multi-step non-linear procedure that relatively fast converges to 
a best PL+NFW mass model. At this level, we neglect the smooth 
lens L , for which a satisfactory model already has been obtained 
after the first non-linear optimization, and the perturbed models 
1-7,8,10,11 for which the substructure could not be recovered. We pro- 
ceed as follows: 

(i) we fix the main lens parameters to the best values found in Sec- 
tion d5.3t , (»7i,/ls,i). We set the substructure mass to some guess 
value. We optimize for the substructure position x su b,i- 

(ii) we fix (//p/l.,,1 j and the source position x subl . We optimize for 
the substructure mass m sub ,i . 

(iii) we run the non-linear procedure described in Section © by 
alternatively optimising for the main lens, source, and substructure 
parameters and for the level of source regularization. 

This leads to a new set of parameters, (J7 b ,/l s .b,'?isub.b,-*sub,bl- Final 
results for the considered models are listed in Table 3 and the rel- 
ative residuals are shown in the Figs. 15171 respectively. For all the 
considered lenses the final reconstruction converges to the noise 
level. 



5.5 Multiple substructures 

The lens system L13 represents a more complex case in which two 
substructures are included. In particular we are interested in test- 
ing whether both substructures are detectable and whether their ef- 
fect may be hidden by the presence of external shear. As for the 
previously considered cases, we first perform a non-linear recon- 
struction of the main lens parameters assuming a single PL mass 
model. For this particular system we also include the strength r s h 
and the position angle # sh of the external shear as free parameters. 
Results for this first step of the reconstruction are shown in Fig. 
|10(a)| We then run the linear potential reconstruction. One of the 
two substructures is detected although a significant level of image 
residuals is left (Fig. II It. The combined effect of external shears 
(r sh = -0.031) and the substructure in Pi is not sufficient to ex- 
plain the perturbation generated by the second substructure at the 



lowest surface brightness point of the Einstein ring. We therefore 
include a NFW substructure in the recovered position and run a 
non-linear reconstruction for the new PL+NFW model, Fig, | 10(b)] 
We are then able to detect also the second substructure, Fig. [12] Fi- 
nally we run a global non-linear reconstruction for the PL+2NFW 
model (Fig. | 10(c)} , the noise level is reached and the strength of the 
external shear is consistent with zero (r sh = 0.0001). 

5.6 Nested sampling: the evidence for substructure 

When modelling systems as L or L^ we assume that the best 
recovered values, under the hypothesis of a single power-law, pro- 
vide a good description of the true mass distribution and that any 
eventually observed residual could be an indication for the presence 
of mass substructure. Model comparison within the framework of 
Bayesian statistics gives us the possibility to test this assumption. 

J. 6. 1 Marginalized Bayesian evidence 

In order to statistically compare two models the marginalized evi- 
dence \39\ has to be computed. As described in Section d4.il ) this 
multi-dimensional and non-linear integral can be evaluated using 
the Nested-Sampling technique by Skilling ( 2004). Specifically the 
two mass models we wish to compare are a single PL, M , versus 
a PL+NWF substructure, Mi. The first one is completely defined 
by the non-linear parameters (17, A s ), while the second needs three 
extra parameters, namely the substructure mass and position. For 
all these parameters prior probabilities have to be defined: 

{constant for - r/j\ < Srjj 
(48) 
for |/7 b .i -t]i\> Srji 

and 

I constant for |x sub , b ,i - x subii | < 6x mbi 
(49) 
for |* sl , b , bj - jc sub i | > 6x subi 

where the elements of Srjj and 6x sllb i are empirically asses sed such 
that t he bulk of the evidence likelihood is included (see Skill ins 
120041) . The prior on the substructure mass is flat between the 
lower and upper mass limits given by numerical simulations (e.g. 
iDiemand et alll2007 alibi) . Given the lenses Lo,i,2,4,9,i2,i3 we run the 
Nested Sampling twice, once for the single PL model and once 
for the PL+NFW (+NFW) one. The two marginalized evidences 
with corresponding numerical errors can be compared from Ta- 
ble 2. Despi te a certain nu mber of authors suggest the use of Jef- 
freys' scale jjeffrevslll96ll) for model comparison, we adopt here 
a more conservative criterion. In particular, we note that the per- 
turbed model Mi for the lens system Lo is basically consistent with 
a single smooth PL model M , with A£ ~ 7.85. The Bayesian fac- 
tor for the system L4 is of the order of A£ ~ 21.5 in favour of 
the smooth model Mo, indicating that the detection of such a low- 
mass substructure can formally not be claimed at a significant level. 
The reason why we think this substructure is clearly visible in the 
grid-based results, is that this particular solution is the maximum- 
posterior (MP) solution, whereas the evidence gives the integral 
over the entire parameter space. This implies that there must be 
many solutions near the MP solution that do not show the substruc- 
ture. This indicates that our approach of quantifying the evidence 
for substructure is very conservative. On the other hand the Bayes 
factor for the lens Li, Afi = - 17.1, clearly shows that the detection 
of a 1O 7 M substructure can be significant when the latter is located 
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Table 2. marginalized evidence and corresponding standard deviation as 
obtained via the Nested-Sampling integration. Results are shown for the 
hypothesis of a smooth lens (PL) and the hypothesis of a clumpy lens po- 
tential (PL+NFW). 



y 

2 

Image Residual 




Figure 4. Results of the non-linear optimization for the smooth lens Lo . The 
top-right panel shows the original mock data, while the top-left one shows 
the final reconstruction. On the second row the source reconstruction (left) 
and the image residuals (right) are shown. 



at a different position on the ring. Finally all higher mass perturba- 
tions are easily detectable independently of their position relative 
to the image ring; Bayes factor for L 2 , L 9 , L 12 and Li 3 are, in fact, 
respectively Afi = -213.0, AS = -2609.7, Afi = -4603.4 and 
Afi = -1835.7. Substructure properties for these systems are also 
confidently recovered. The main difference between Jeffreys' scale 
and our criterion for quantifying the significance level of the sub- 
structure detection is observed for the system L| . If we had to adopt 
Jeffreys' scale in fact, such detection would have to be claimed de- 
cisive while we think it is only significant. 

5.7 Posterior probabilities 

As discussed in Section ( 14.11 ) an interesting by-product of the 
Nested-Sampling procedure is an exploration of the posterior prob- 
ability ( 1341 ) which provides us with statistical errors on the model 
parameters, see Tables 3 and 4. The relative posterior probabili- 
ties for Lo, Li and L 2 are plotted in Fig. [T3] Fig. [14] and Fig. \T5\ 
respectively. Lets start by considering the lens system Lo and the 
relative probability distribution for the substructure mass. Although 
the model Mi , in terms of marginalized evidence, is consistent with 
the single smooth PL model Mo, there is a small probability for 
the presence of a substructure with mass up to few 10 8 M G located 
as far as possible from the ring. The effect of such objects on the 
lensed image would be very small and could be easily hidden by 
introducing artificial features in the source structure, as suggested 
by the posterior distributions for the source regularization constant. 
This means, that from the image point of view, a smooth single 
PL model and a perturbed PL+NWF with a substructure of 10 8 M o , 
located far from ring, are not distinguishable from each other as 
long as the effect of the perburber can be hidden in the structure 
of the source. From a probabilistic point of view, however, the sec- 
ond scenario is more unlikely to happen. A similar argument can be 
applied to the lens Lj for which a strong degeneracy between the 
mass and the position of the substructure is observed. We conclude 
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therefore that, although this substructure can be detected at a statis- 
tically significant level, its mass and position cannot be confidently 
assessed yet. In contrast, for systems such as L 2 _9_i 2 , the effect of the 
substructure is so strong that it can not be mimicked by the source 
structure or by a different combination of the substructure parame- 
ters. For these cases not only the detection is highly significant, but 
the properties of the perturber can be confidently constrained with 
minimal biases. 



6 CONCLUSIONS AND FUTURE WORK 

We have introduced a fully Bayesian adaptive method for objec- 
tively detecting mass substructure in gravitational lens galaxies. 
The implemented method has the following specific features: 

• Arbitrary imaging data-set defined on a regular grid can be 
modelled, as long as only lensed structure is included. The code is 
specifically tailored to high-resolution HST data-sets with a com- 
pact PSF that can be sampled by a small number of pixels. 

• Different parametric two-dimensional mass-models can be 
used, with a set of free parameter i]. Currently, we have im- 
pleme nted the elliptical power-law density models from iBarkanal 

but other models can easily be included. Multiple paramet- 
ric mass models can be simultaneously optimized. 

• A grid-based correction to the parametric potential can iter- 
atively be determined for any perturbation that can not easily be 
modelled within the chosen family of potential models (e.g. warps, 
twists, mass-substructures, etc.). 

• The source surface-brightness structure is determined on a 
fully adaptive Delaunay tessellation grid, which is updated with 
every change of the lens potential. 

• Both model-parameter optimization and model ranking are 
fully embedded in a Bayesian framework. The method takes special 
care not to change the number of degrees of freedom during the op- 
timization, which is ensured by the adaptive source grid. Methods 
with a fixed source surface-brightness grid can not do this. 

• Both source and potential solutions are regularised, based on 
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Figure 5. Left panel: Results of the first non-linear reconstruction for the smooth component of the perturbed lens L i . The top-right panel shows the original 
mock data, while the top-left one shows the final reconstruction. On the second row the source reconstruction (left) and the image residuals (right) are shown. 
Right panel: Final results of the non-linear reconstruction for the perturbed lens Li . The top-right panel shows the original mock data, while the top-left 
one shows the final model reconstruction obtained after a non-linear optimization involving the lens parameters and the substructure position and mass. The 
recovered source is plotted in the low-left panel. Image residuals (right) are shown. 
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Figure 6. Similar as Figure|5]for Li- 



a smoothness criterion. The choice of regularization can be modi- 
fied and the level of regularization is set by Bayesian optimization 
of the evidence. The data itself determine what level of regulariza- 
tion is needed. Hence overly smooth or overly irregular structure is 
automatically penalised. 

• The maximum-posterior and the full marginalized probabil- 
ity distribution function of all linear and non-linear parameters can 
be determined, marginalized over all other parameters (including 



regularization). Hence a full exploration of all uncertainties of the 
model is undertaken. 



• The full marginalized evidence (i.e. the probability of the 
model given the data) is calculated, which can be used to rank 
any set of model assumptions (e.g. pixel size, PSF) or model fami- 
lies. In our case, we intend to compare smooth models with models 
that include mass substructure. The marginalized evidence implic- 
itly includes Occam's razor and can be used to assess whether sub- 
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structure or any other assumption is justified, compared to a null- 
hypothesis. 

The method has been tested and calibrated on a set of artifi- 
cial but realistic lens systems, based on the lens system SLACS 
J1627-0055. 

The ensemble of mock data consists of a smooth PL lens and thir- 
teen clumpy models including one or two NFW substructures. Dif- 
ferent values for the mass and the substructure position have been 
considered. Using the Bayesian optimization strategy developed in 
this paper we are able to recover the smooth PL system and all per- 
turbed models with a substructure mass > 10 7 M o when located at 
the lowest surface brightness point on the Einstein ring and with 
a mass > 10 9 M o when located just inside or outside the ring (i.e. 
their Einstein rings need to overlap roughly). For all these models 
we have convincingly recovered the best set of non-linear param- 
eters describing the lens potential and objectively set the level of 
regularization. 

Furthermore, our implementation of the Nested-Sampling tech- 
nique provides statistical errors for all model parameters and al- 
lows us to objectively rank and compare different potential models 
in terms of Bayesian evidence, removing as much as possible any 
subjective choices. Any choice can quantitatively be ranked. For 
each of the lens systems we compare a complete smooth PL mass 
model with a perturbed PL+NFW (+NFW) one. The method here 
developed allows us to solve simultaneously for the lens potential 
and the lensed source. The latter, in particular, is reconstructed on 
an adaptive grid which is re-computed at every step of the optimiza- 
tion, allowing to take into account the correct number of degrees of 
freedom. 

In this paper we have considered systems which contains at most 
two CDM substructures. Although it may appear as a very small 
number when compared with predictions from N-body simulations 
within the virial radius, this represents a realistic scenario. As we 
have shown, our method, with current HST data, is mostly sensi- 
tive to perturbations with mass > 10 7 M Q and located on the Ein- 
stein ring (Ad ~ /jO er ). The projected volume that we are able to 
probe is therefore small compared to the projected volume within 
the virial radius. The probability that more than two substructures 
have this right combination of mass and position is relatively low 
and we expect most of the real systems to be dominated by one or 
at most two perturbers. Despite these new results, further improve- 
ments can still be made. We think, for example, that an adaptive 
source grid based on surface brightness, rather than magnification, 
or a combination, could be more suitable for the scientific problem 
considered here. 

The method will soon be applied to real systems, as for exam- 
ple from the Sloan Lens ACS Survey sample of massive early- 
type g alaxies dKoopmans et alj2006l : lBolton et al.l2006l : lTreu et al.l 
|2006|) . This will lead to powerful new constraints or limits on the 
fraction and mass distribution of substructure. Results will be com- 
pared with CDM simulations. 
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Figure 7. Similar as Figure [5]f or L[2. 
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Figure 8. Results of the linear source and potential reconstruction for the lens Li. The first row shows the original model (left), the reconstructed model 
(middle) and the current-best source, as well as the corresponding adaptive grid. On the second row the image residuals (left), the total potential convergence 
(middle) and the substructure convergence (right) are shown. Note that the substructure, although weak, is reconstructed at the correct position. 
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Figure 9. Similar as Figure[8]for Lo. We note that the substructure is extremely well reconstructed, both at the correct position and in mass. 
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Figure 10. Non linear reconstruction of the lens L13 for a single PL model, a PL+NFW and a PL+2NFW one. 
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Figure 11. Results of the first linear source and potential reconstruction for the lens L13. The first row shows the original model (left), the reconstructed model 
(middle) and the image residuals. On the second row the current-best source (left), the total potential convergence (middle) and the substructure convergence 
(right) are shown. Note that the substructure, although weak, is reconstructed at the correct position. 
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Figure 12. Results of the second linear source and potential reconstruction for the lens L13. 
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Figure 13. Posterior probability distributions for the non linear parameters of the smooth lens model Lo as obtained from the Nested-Sampling evidence 
exploration. In particular results for two different models are shown, a smooth PL potential (blue histograms) and a perturbed PL+NFW lens (black histograms). 
From up left, the lens strength, the position angle, the axis ratio, the slope, the logarithm of the source regularization constant, the substructure mass and position 
are plotted. 
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Figure 14. Similar as Figure [T3l for Li . 
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Figure 15. Similar as Figure [T3l for L2. 
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Table 3. Non-linear parameters for the mass model distribution. For each of the considered systems we report the true set r/true of non-linear parameters 
used to create the mock data (True), the best set (Best) recovered via the optimisation strategy described in Section 3.2 and the average with relative 
standard deviations values given by the Nested Sampling, under the hypothesis of a single power-law potential (PL) and of a perturbed PL+NFW lens. 



Lens Model b u h 6 a e f 07 q a q log(A s ) °~iog(\ s ) m sub °m sub ^sub cr^ stt6 Vsub a V sv ,b 

(arcsec) (arcsec) (deg) (deg) (10 10 Mq ) ( 10 10 Mq ) (arcsec) (arcsec) (arcsec) (arcsec) 



L 


True 


1 


.192 




12.23 




0.891 




0.540 
























Best 


1 


.162 




12.35 




0.873 




0.584 






-2.292 


















PL 


1 




0.008 


12.15 


0.394 


0.897 


0.005 


0.532 





012 


-1.619 


0.029 
















PL+NFW 


1 


.187 


0.005 


14.35 


1.907 


0.882 


0.001 


0.538 





006 


-2.221 


0.027 


0.019 


0.013 


1.220 


1.111 


1.103 


1.314 


u 


True 


1 


.192 




12.23 




0.891 




0.540 










0.001 




0.900 




1.190 






Best 


1 


.195 




11.87 




0.893 




0.545 






0.912 




0.001 




0.918 




1.174 






PL 


1 


.197 


0.004 


10.39 


0.253 


0.899 


0.002 


0.543 





005 


0.970 


0.039 
















PL+NFW 


1 


.20.-. 


0.002 


12.85 


0.530 


0.896 


0.001 


0.5321 





003 


0.029 


0.012 


0.004 


1.157 


0.019 


1.436 


0.014 






True 


1 


.192 




12.23 




0.891 




0.540 










0.010 




0.900 




1.190 






Best 


1 


213 




12.17 




0.896 




0.522 






3.563 




0.010 




0.917 




1.184 






PL 


1 


.188 


0.001 


17.81 


0.251 


0.905 


0.001 


0.553 





002 


1.187 


0.006 
















PL+NFW 


1 


.194 


0.004 


13.10 


0.303 


0.892 


0.003 


0.547 





005 


1.212 


0.025 


0.013 


0.001 


0.919 


0.008 


1.219 


0.011 


U 


True 


1 


.192 




12.23 




0.891 




0.540 










0.001 




-0.500 




1.000 






Best 


1 


.151 




11.46 




0.874 




0.596 






3.111 




0.001 




-0.502 




-0.916 






PL 


1 


.203 


0.008 


10.87 


0.156 


0.888 


0.004 


0.541 





009 


1.107 


0.038 
















PL+NFW 


1 


.177 


0.006 


10.90 


0.290 


0.877 


0.004 


0.567 





007 


1.104 


0.022 


0.0008 


0.0003 


-0.302 


0.096 


-0.633 


0.019 


U 


True 


1 


.192 




12.23 




0.891 




0.540 










0.100 




-0.100 




-0.600 






Best 


1 


.186 




11.76 




0.883 




0.559 






1.379 




0.103 




-0.105 




-0.595 






PL 


1 


.251 


0.001 


21.73 


0.018 


0.8831 


0.0005 


0.580 





001 


0.261 


0.004 
















PL+NFW 


1 


.215 


0.002 


11.85 


0.284 


0.9210 


0.0001 


0.516 





004 


0.358 


0.005 


0.9900 


0.0002 


-0.099 


0.001 


-0.607 


0.001 


L12 


True 


1 


.192 




12.23 




0.891 




0.540 










0.100 




-0.900 




-1.400 






Best 


1 


.188 




11.73 




0.887 




0.556 






2.831 




0.105 




-0.919 




-1.402 






PL 


1 


.154 


0.029 


1.752 


0.016 


0.881 


0.001 


0.598 





027 


0.948 


0.003 
















PL+NFW 


1 


.203 


0.001 


11.71 


0.297 


0.8841 


0.0003 


0.537 





002 


0.997 


0.007 


0.101 


0.001 


-0.906 


0.002 


-1.409 


0.002 



Table 4. Non-linear parameters for the mass model distribution for the system L13. We report the true set r]true of non-linear parameters used to create the mock data (True), the best 
set (Best) recovered via the optimisation strategy described in Section 3.2 and the average with relative standard deviations values given by the Nested Sampling, under the hypothesis 
of a single power-law potential (PL) and of a perturbed PL+2 NFW lens. 



Model 6 a b ag f 07 q a q T sh <j T Bh 6 sh ag sh log(A s ) °"log(A,,) m sub &m m b x sub a *, ub Vsub °y S ub 

(arcsec) (arcsec) (deg) (deg) (lO lo A/0) (1O 1O M0) (arcsec) (arcsec) (arcsec) (arcsec) 

True 1.192 12.23 0.891 0.540 0.000 0.000 0.0100 0.900 1.190 

0.0100 -0.500 -1.000 

Best 1.193 12.32 0.892 0.549 0.0001 0.0001 3.553 0.0100 0.910 1.190 

0.0100 -0.499 -1.006 

PL 1.182 0.012 12.31 0.022 0.867 0.010 0.580 0.016 -0.001 0.004 0.006 0.020 1.263 0.005 

PL+2NFW 1.195 0.001 12.32 0.002 0.894 0.015 0.548 0.001 0.0006 0.0002 0.0009 0.0017 1.268 0.003 0.0101 0.0003 0.910 0.002 1.189 0.001 

0.0101 0.0002 -0.499 0.001 -1.000 0.001 



